SB_Biostats_Midterm_2023: Understanding Morphologies of Different Cicada Species

Shiraz Belblidia

2023-10-17

Background

Cicadas are a superfamily of insects that are noteable for their cyclic emergence and distinct mating calls. Cicada species found in Eastern North America emerge in intervals of 13 to 17 years, depending on the species. Other species of cicada found in other parts of the world emerge on an annual basis.

A Brood X cicada lounges on a leaf.

Background

In recent years, emergence has been found to be ahead of schedule, suggesting that environmental factors that may signal for emergence are changing, possibly due to a warming climate.

Cicada Data Set

The chosen dataset contains wing and body measurements of 3 distinct cicada species:

These species are annual cicadas typically found in Eastern and Southeastern Asia. The three species in this dataset share geographic similarity, so I am interested to see any correlations that can be found in their morphology.

The data was accessed from the following site: https://datadryad.org/stash/dataset/doi:10.5061/dryad.4xgxd25bg

Why did I choose a cicada data set?

I don’t study cicadas in my academic research. So why did I pick this subject?

Firstly, the dataset was appealing from a data analysis perspective. There are lots of things I can do to rearrange the data to become useful.

Secondly, this dataset intrigued me because I had previously only known about periodical cicadas (I lived in the DC area in 2021, and experienced the emergence of the Great Eastern Brood).

And thirdly, and quite simply…

Why did I choose a cicada data set?

Reading in my data

The first step after uploading the dataset to my github repository as a .csv file was to read it into RStudio.

#read in data as a csv file
cicada_data <- read.csv("https://raw.githubusercontent.com/sbelblidia/Biostats_2023/main/CicadaData.csv")

Parsing out the “Code” column

Column 2 is a list of codes assigned to each observation. Upon reading the ReadMe file that was included with the dataset, I learned that the code corresponds to several types of information that I could better understand if they were separated into their own variables. See ReadMe excerpt below:

The second column, “Code”, is used to show the individual wing information. X-XXX-XXX

Parsing out the “Code” column

Using tidyr, I separated the “Code” column into 3 separate columns: “species”, “individual”, and “sex_wing”.

#install and load tidyr
install.packages("tidyr")
## 
## The downloaded binary packages are in
##  /var/folders/gr/7y000rm51gb3d4_w15tq0x6h0000gn/T//RtmpwUjX0c/downloaded_packages
library(tidyr)

#separate "code" into new variables
cicada_data <- separate(cicada_data, col = Code, into = c("species", "individual", "sex_wing"), sep = "-")

Parsing out the “Code” column

I then further separated “sex_wing” into “sex”, “fore_or_hind”, and “left_or_right” using dplyr and tidyr.

#install and load dplyr
install.packages("dplyr")
## 
## The downloaded binary packages are in
##  /var/folders/gr/7y000rm51gb3d4_w15tq0x6h0000gn/T//RtmpwUjX0c/downloaded_packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#separating the first, second, and third digits in "sex_wing" by hyphens
cicada_data <- cicada_data %>% 
  mutate(sex_wing = gsub("(\\d)(\\d)(\\d)", "\\1-\\2-\\3", sex_wing))

#separate "sex_wing" into new variables
library(tidyr)
cicada_data <- separate(cicada_data, col = sex_wing, into = c("sex", "fore_or_hind", "left_or_right"), sep = "-")

Reformatting “species” column

The “species” column has numbers assigned to each species. I wanted to change them to the actual species name so that each observation is more easily distinguishable. From the ReadMe file, the assignments are as follows:

#use ifelse to reassign the values in the column to match the species
cicada_data$species <- ifelse(
  cicada_data$species == "1", "CA",
  ifelse(cicada_data$species == "2", "MM",
  ifelse(cicada_data$species == "3", "PK",
  cicada_data$species)))

Reformatting sex and wing columns

I ran a similar code chunk to clearly label the sex column as male or female, and the two wing type columns as fore/hind and left/right. This was to help me easily create subsets that I wanted to look at in my data visualizations.

#use ifelse to reassign the values in the column to match what was in the readme
cicada_data$sex <- ifelse(
  cicada_data$sex == "1", "m",
  ifelse(cicada_data$sex == "2", "f",
  cicada_data$sex))
cicada_data$fore_or_hind <- ifelse(
  cicada_data$fore_or_hind == "1", "fore",
  ifelse(cicada_data$fore_or_hind == "2", "hind",
  cicada_data$fore_or_hind))
cicada_data$left_or_right <- ifelse(
  cicada_data$left_or_right == "1", "left",
  ifelse(cicada_data$left_or_right == "2", "right",
  cicada_data$left_or_right))

Renaming variables

With 5 new variables to analyze from the original “Code” variable, I wanted to finish formatting the other variable names to ones that I want to work with. This step would make coding more intuitive for me later on in my data analysis.

#use rename() to rename columns
library(dplyr)
cicada_data <- rename(cicada_data, "date" = Time, "data_points" = N, "wing_length" = WL, "wing_width" = WW, "wing_area" = WA, "body_mass" = BM, "body_length" = BL, "eye_dist" = ED)

#check column names
names(cicada_data)
##  [1] "date"          "species"       "individual"    "sex"          
##  [5] "fore_or_hind"  "left_or_right" "data_points"   "wing_length"  
##  [9] "wing_width"    "wing_area"     "body_mass"     "body_length"  
## [13] "eye_dist"

Removing “data points” column

The “data points” column was unnecessary for my analysis. It was used by the original researchers to calculate wing area for each sample. Because wing area is included in the data set, I removed “data points” from the set using indexing.

#remove the 7th column from the dataset
cicada_data <- cicada_data[,-7]
#check the dataset to make sure the correct column was removed
head(cicada_data)
##       date species individual sex fore_or_hind left_or_right wing_length
## 1 2021-7-7      CA        001   m         fore          left      5.3593
## 2 2021-7-7      CA        001   m         fore         right      5.2168
## 3 2021-7-7      CA        001   m         hind          left      2.8250
## 4 2021-7-7      CA        001   m         hind         right      2.7846
## 5 2021-7-7      CA        002   f         fore          left      5.6031
## 6 2021-7-7      CA        002   f         fore         right      5.4242
##   wing_width wing_area body_mass body_length eye_dist
## 1     1.7470    6.4620    3.2343         4.4      1.7
## 2     1.7123    6.2323    3.2343         4.4      1.7
## 3     1.4106    3.0201    3.2343         4.4      1.7
## 4     1.3988    2.9802    3.2343         4.4      1.7
## 5     1.7619    6.6530    3.4633         4.3      1.8
## 6     1.7760    6.6407    3.4633         4.3      1.8

Checking my data types

I wanted to make sure that the necessary variables are stored as the data type that I needed. “species”, “sex”, “fore_or_hind”, and “left_or_right” should be in factor format for future filtering steps. All body and wing measurements should be numeric.

#use class() to check that measurements are stored as numeric
class(cicada_data$wing_area)
## [1] "numeric"
#check non-numeric data
class(cicada_data$species)
## [1] "character"

“species”, “sex”, “fore_or_hind”, and “left_to_right” had been stored as characters. I converted them to factors.

#use factor() to change the needed data types
cicada_data$species <- factor(cicada_data$species)
cicada_data$sex <- factor(cicada_data$sex)
cicada_data$fore_or_hind <- factor(cicada_data$fore_or_hind)
cicada_data$left_or_right <- factor(cicada_data$left_or_right)

Hypothesis # 1: Correlation between body size and loudness

I first wanted to look into the data to find correlations between the morphologies of the different species and sexes and the ability to produce sound. Male cicadas create their calling sound by expanding and contracting an organ in their abdomen called the tymbal.

The following are the calling sounds of species of the same genus as those in my dataset:

Cryptotympana

Meimuna

Platypleura

I hypothesize, based on the volume and complexity of the calling sounds of the different genera, that Meimuna mongolica will have the largest abdomen and therefore the largest body measurements. Conversely, I hypothesize that Platypleura kaempferi will have the smallest body measurements.

Hypothesis # 1: Correlation between body size and loudness

First, I created a new variable for the ratio between body mass and body length.

#create a new variable for the ratio between body mass and length
library(dplyr)
cicada_data <- cicada_data %>% mutate(body_mass/body_length)

Hypothesis # 1: Correlation between body size and loudness

Then, I plotted the full dataset on histograms to look at body mass, body length, and mass/length.

#make histograms for body mass, body length, and mass:length ratio
par(mfrow = c(1,3))
hist(cicada_data$body_mass, main = "Body Mass", col = "red")
hist(cicada_data$body_length, main = "Body Length", col = "blue")
hist(cicada_data$`body_mass/body_length`, main = "Mass/Length", col = "green")

Hypothesis # 1: Correlation between body size and loudness

The data looked skewed, and I suspected it was due to the three species being lumped together. Using QQPlot, I verified that my data was skewed.

#put 3 graphs next to each other
par(mfrow = c(1,3))

#qqplot for body mass
qqnorm(cicada_data$body_mass, main = "Body Mass")
#add qqline for reference
qqline(cicada_data$body_mass)

#qqplot for body length
qqnorm(cicada_data$body_length, main = "Body Length")
#add qqline for reference
qqline(cicada_data$body_length)

#qqplot for mass/length
qqnorm(cicada_data$`body_mass/body_length`, main = "Mass/Length")
#add qqline for reference
qqline(cicada_data$`body_mass/body_length`)

Hypothesis # 1: Correlation between body size and loudness

So I filtered the data by species and then created separate histograms.

#filter dataset by species
ca_cicadas <- filter(cicada_data, species == "CA")
mm_cicadas <- filter(cicada_data, species == "MM")
pk_cicadas <- filter(cicada_data, species == "PK")

#make a 3x3 grid of graphs
par(mfrow = c(3,3))

#make species-specific histograms
hist(ca_cicadas$body_mass, main = "CA", xlab = "mass (g)", ylab = "Freq.", col = "red")
hist(mm_cicadas$body_mass, main = "MM", xlab = "mass (g)", ylab = "Freq.", col = "red")
hist(pk_cicadas$body_mass, main = "PK", xlab = "mass (g)", ylab = "Freq.", col = "red")

hist(ca_cicadas$body_length, main = "", xlab = "length (cm)", ylab = "Freq.", col = "blue")
hist(mm_cicadas$body_length, main = "", xlab = "length (cm)", ylab = "Freq.", col = "blue")
hist(pk_cicadas$body_length, main = "", xlab = "length (cm)", ylab = "Freq.", col = "blue")

hist(ca_cicadas$`body_mass/body_length`, main = "", xlab = "mass/length (g/cm)", ylab = "Freq.", col = "green")
hist(mm_cicadas$`body_mass/body_length`, main = "", xlab = "mass/length (g/cm)", ylab = "Freq.", col = "green")
hist(pk_cicadas$`body_mass/body_length`, main = "", xlab = "mass/length (g/cm)", ylab = "Freq.", col = "green")

Hypothesis # 1: Correlation between body size and loudness

Sure enought, qqplots of the species-specific datasets yielded more normal distributions.

#make a 3x3 grid of plots
par(mfrow = c(3,3))

#species-specific qqplots for body mass

qqnorm(ca_cicadas$body_mass, main = "CA Body Mass")
#add qqline for reference
qqline(ca_cicadas$body_mass)

qqnorm(mm_cicadas$body_mass, main = "MM Body Mass")
#add qqline for reference
qqline(mm_cicadas$body_mass)

qqnorm(pk_cicadas$body_mass, main = "PK Body Mass")
#add qqline for reference
qqline(pk_cicadas$body_mass)


#species-specific qqplot for body length

qqnorm(ca_cicadas$body_length, main = "CA Body Length")
#add qqline for reference
qqline(ca_cicadas$body_length)

qqnorm(mm_cicadas$body_length, main = "MM Body Length")
#add qqline for reference
qqline(mm_cicadas$body_length)

qqnorm(pk_cicadas$body_length, main = "PK Body Length")
#add qqline for reference
qqline(pk_cicadas$body_length)


#species-specific qqplot for mass/length

qqnorm(ca_cicadas$`body_mass/body_length`, main = "CA Mass/Length")
#add qqline for reference
qqline(ca_cicadas$`body_mass/body_length`)

qqnorm(mm_cicadas$`body_mass/body_length`, main = "MM Mass/Length")
#add qqline for reference
qqline(mm_cicadas$`body_mass/body_length`)

qqnorm(pk_cicadas$`body_mass/body_length`, main = "PK Mass/Length")
#add qqline for reference
qqline(pk_cicadas$`body_mass/body_length`)

Hypothesis # 1: Correlation between body size and loudness

Calculating the mean and median values for each variable:

#find the mean body mass of each species
ca_mean_mass <- mean(ca_cicadas$body_mass)
mm_mean_mass <- mean(mm_cicadas$body_mass)
pk_mean_mass <- mean(pk_cicadas$body_mass)

#print mean values
cat("CA Mean Mass:", ca_mean_mass, "\n")
## CA Mean Mass: 3.176041
cat("MM Mean Mass:", mm_mean_mass, "\n")
## MM Mean Mass: 0.7894469
cat("PK Mean Mass:", pk_mean_mass, "\n")
## PK Mean Mass: 0.6080031
#find the median body mass of each species
ca_med_mass <- median(ca_cicadas$body_mass)
mm_med_mass <- median(mm_cicadas$body_mass)
pk_med_mass <- median(pk_cicadas$body_mass)

#print median values
cat("CA Median Mass:", ca_med_mass, "\n")
## CA Median Mass: 3.22225
cat("MM Median Mass:", mm_med_mass, "\n")
## MM Median Mass: 0.7997
cat("PK Median Mass:", pk_med_mass, "\n")
## PK Median Mass: 0.6038
#find the mean body length of each species
ca_mean_length <- mean(ca_cicadas$body_length)
mm_mean_length <- mean(mm_cicadas$body_length)
pk_mean_length <- mean(pk_cicadas$body_length)

#print the mean values
cat("CA Mean Length:", ca_mean_length, "\n")
## CA Mean Length: 4.240244
cat("MM Mean Length:", mm_mean_length, "\n")
## MM Mean Length: 2.95144
cat("PK Mean Length:", pk_mean_length, "\n")
## PK Mean Length: 2.182812
#find the median body length of each species
ca_med_length <- median(ca_cicadas$body_length)
mm_med_length <- median(mm_cicadas$body_length)
pk_med_length <- median(pk_cicadas$body_length)

#print the median values
cat("CA Median Length:", ca_med_length, "\n")
## CA Median Length: 4.3
cat("MM Median Length:", mm_med_length, "\n")
## MM Median Length: 3
cat("PK Median Length:", pk_med_length, "\n")
## PK Median Length: 2.2
#find the mean mass/length ratio of each species
ca_mean_ml_ratio <- mean(ca_cicadas$`body_mass/body_length`)
mm_mean_ml_ratio <- mean(mm_cicadas$`body_mass/body_length`)
pk_mean_ml_ratio <- mean(pk_cicadas$`body_mass/body_length`)

#print the mean values
cat("CA Mean Mass/Length:", ca_mean_ml_ratio, "\n")
## CA Mean Mass/Length: 0.7472408
cat("MM Mean Mass/Length:", mm_mean_ml_ratio, "\n")
## MM Mean Mass/Length: 0.2672061
cat("PK Mean Mass/Length:", pk_mean_ml_ratio, "\n")
## PK Mean Mass/Length: 0.2782034
#find the median mass/length ratio of each species
ca_med_ml_ratio <- median(ca_cicadas$`body_mass/body_length`)
mm_med_ml_ratio <- median(mm_cicadas$`body_mass/body_length`)
pk_med_ml_ratio <- median(pk_cicadas$`body_mass/body_length`)

#print the median values
cat("CA Median Mass/Length:", ca_med_ml_ratio, "\n")
## CA Median Mass/Length: 0.7460263
cat("MM Median Mass/Length:", mm_med_ml_ratio, "\n")
## MM Median Mass/Length: 0.2697586
cat("PK Median Mass/Length:", pk_med_ml_ratio, "\n")
## PK Median Mass/Length: 0.2765

Hypothesis # 2: Correlation between body size and sex

Because the tymbal is found in the abdomen of the male cicada, I hypothesize that the abdomens of female cicadas will, on average, be smaller than that of their male counterparts. Thus, I hypothesize that female cicadas will have smaller body measurements than males.

Hypothesis # 2: Correlation between body size and sex

Plotting body mass, body length, and mass/length ratio by sex showed many outliers.

install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/gr/7y000rm51gb3d4_w15tq0x6h0000gn/T//RtmpwUjX0c/downloaded_packages
library(ggplot2)

install.packages("ggpubr")
## 
## The downloaded binary packages are in
##  /var/folders/gr/7y000rm51gb3d4_w15tq0x6h0000gn/T//RtmpwUjX0c/downloaded_packages
library(ggpubr)

ggplot(cicada_data, aes(x = sex, y = body_mass, fill = sex)) +
  geom_boxplot(notch = TRUE)

ggplot(cicada_data, aes(x = sex, y = body_length, fill = sex)) +
  geom_boxplot(notch = TRUE)

ggplot(cicada_data, aes(x = sex, y = body_mass/body_length, fill = sex)) +
  geom_boxplot(notch = TRUE)

Hypothesis # 2: Correlation between body size and sex

Separating the plots by species:

ggplot(ca_cicadas, aes(x = sex, y = body_mass, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Cryptotympana atrata") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(mm_cicadas, aes(x = sex, y = body_mass, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Meimuna mongolica") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(pk_cicadas, aes(x = sex, y = body_mass, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Platypleura kaempferi") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(ca_cicadas, aes(x = sex, y = body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Cryptotympana atrata") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(mm_cicadas, aes(x = sex, y = body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Meimuna mongolica") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(pk_cicadas, aes(x = sex, y = body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Platypleura kaempferi") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(ca_cicadas, aes(x = sex, y = body_mass/body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Cryptotympana atrata") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(ca_cicadas, aes(x = sex, y = body_mass/body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Meimuna mongolica") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(ca_cicadas, aes(x = sex, y = body_mass/body_length, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Platypleura kaempferi") +
  theme(plot.title = element_text(hjust = 0.5))

Hypothesis # 3: Correlation between wing size and sex

I then wanted to look at the wing measurements gathered in the dataset to understand a little more about how wing size differs between sexes. In many other animal species (such as birds), wings are larger and more ornate so as to attract their mate.

I hypothesize that males will have larger wings than their female counterparts.

Hypothesis # 3: Correlation between wing size and sex

Plotting wing sizes showed that the data points clustered into distinct groups, but sex did not seem to be a factor.

#make a scatter plot of wing length vs wing width, with small translucent dots, categorized. by sex
ggplot(cicada_data, aes(x = wing_length, y = wing_width, color = sex)) +
  geom_point(size = 0.7, alpha = 0.5) +
  labs(title = "Wing Sizes by Sex", x = "wing length (cm)", y = "wing windth (cm)") +
  theme(plot.title = element_text(hjust = 0.5))

Hypothesis # 3: Correlation between wing size and sex

Distinguishing between fore/hind wings and right/left wings showed some clustering patterns for fore/hind, but not left/right.

ggplot(cicada_data, aes(x = wing_length, y = wing_width, color = fore_or_hind, shape = left_or_right)) +
  geom_point(size = 1, alpha = 0.5) +
  labs(title = "Wing Sizes by Anatomy", x = "wing length (cm)", y = "wing windth (cm)") +
  theme(plot.title = element_text(hjust = 0.5))

Hypothesis # 3: Correlation between wing size and sex

Finally, separating the sizes by species showed that wing size was indeed dependent on both species and whether the wing is a forewing or a hindwing.

ggplot(cicada_data, aes(x = wing_length, y = wing_width, color = species, shape = fore_or_hind)) +
  geom_point(size = 1, alpha = 0.5) +
  labs(title = "Wing Sizes by Species", x = "wing length (cm)", y = "wing windth (cm)") +
  theme(plot.title = element_text(hjust = 0.5))

#running anova
aov(wing_length ~ wing_width, data = cicada_data)
## Call:
##    aov(formula = wing_length ~ wing_width, data = cicada_data)
## 
## Terms:
##                 wing_width Residuals
## Sum of Squares   1710.0701  469.0464
## Deg. of Freedom          1      1810
## 
## Residual standard error: 0.5090596
## Estimated effects may be unbalanced

Conclusions

Hypothesis #1

Based on the mean and median values of body mass, body length, and the ratio of body mass to body length, Cryptotympana atrata was the largest species of the data set. Platypleura kaempferi is the smallest. Therefore, the first half of my first hypothesis was incorrect, while the second half was correct.

Hypothesis #2

Based on the box plots of each species, female cicadas across the board were larger than males. Therefore, my second hypothesis was incorrect.

Hypothesis #3

Wing size clustered by species and forewing/hindwing. There was no discernable difference due to sex. Therefore, my third hypothesis was incorrect.

Future directions

Interesting future directions for this dataset could be: